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ABSTRACT 

This paper investigates the reliability of the 
Florida Performance Measurement Systems* Summative Observation 
instrument. Developed for the Florida Beginning Teacher Evaluation 
Program, it provides behavioral ratings for teachers in a classroom 
setting. Data came from ratings of videotapes of nine teachers 
conducting actual lessons by nine teams of trained observers. 
Analysis of variance produced three estimates of reliability for each 
scale and subscale: (1) discriminant (across teachers); (2) stability 
(over time); and (3) interrater (among raters). Results indicate that 
the instrument appears sufficiently reliable to conduct classroom 
observations if ratings by at least two different observers are 
averaged to produce scores. Effective (positive) indicators of 
teacher behavior appear to be more reliably observed than ineffective 
(negative) indicators. Two domains — ManagesMnt of Student Conduct, 
and Communication: Verbal and Nonverbal — appear^-too intercorrelated 
with the other domains for discrete reliable estimation of specific 
behaviors. Future research on this instrument should include 
validation, rater certification, norming and frame factors. 
Appendices contain: (1) background information on the knowledge base 
and the Florida teacher, competencies; (2) indicators of the summative 
instrument; and (3) computation of reliability estimates. (BS) 
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Summary 



The study conducted on the reliability of the Florida Per- 
formance Measurement System's (FPMS) Summative Observation 
Instrument supports the following conclusions: 



* The instnament appears sufficiently reliable to con- 
duct classroom observations if ratings by at least two 
different observers are averaged to produce scores. 

* Effective (positive) indicators of teacher behavior 
appear to be more amenable to reliable observation 
thFJi Ineffective (negative) indicators. 



* Two iomalns, 2 - Management of Student Conduct , and 
5 - Jommw.ni cation: Verbal and Nonverbal, appear too 
intercor related with the other domains for discrete, 
reliable estimation of specific behaviors. These 
domains should probably be used more as indicators 
than as judgement tools. 



Ac know I edgemeni: s 



This paper is an attempt to report the work of numerous 
members of tne Florida Educational Community. The Florida 
Performance Measurement System has been in development for 
several years supported largely by volunteer efforts. 

A core group consisting of B. Othanel Smith, Donovan Pe- 
terson, Jean Borg and Betty Fry have been the primary re- 
searchers and instrument developers throughout tne projects 
history. Robert Soar of the University of Florida, Donald 
Medley of the University of Virginia and Joseph Nazur of the 
University of South Florida were major contributors to the 
design of the reliability study conducted on the Sunmative 
Observation Instniment. In addition to these, Dave Berlin- 
ner of the University of Arizona, and N.L. Gage of Stanford 
contributed to the content validation of the Domain Docu- 
ments . 

Donovan Peterson, B. Othanel Smith, Donald Medley, and 
Garfield Wilson provided contributions to the final form of 
f is paper. 
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Introduction 



The following study was conducted to support the reliability 
of an observation instrument developed to provide behavioral 
ratings for teachers in a classroom setting. Such ratings, 
if they prove feasible, may be used for problem identifica- 
tion, remediation and evaluation. 

Classroom observations appear a necessity for evaluating 
teacher performance, since student outcome variables have 
not proven useful for this purpose. Observation techniques 
and instruments^ however, are notoriously tmreliable; suf- 
fering froir such problems as rater inconsistency, lack of 
objecti\xty, unclear item definitions and changes over time 
in both observers and subjects. For these reasons, extensive 
and rigorous tsts of reliability are rocommended prior to the use 
of any observ'ation instrument. 

An observation instrument was developed for the Florida Be- 
ginning Teacher Evaluation Program, and tests of reliability 
were conducted to answer the question: "How justified are 
the researchers in generalizing the reliability estimates 
from this study to situations other than the one in which 
these estimates were obtained. " The following three types 
of reliability were investigated to answer this question us- 
ing a Three Way ANOVA model (Medley, 1982, Medley and Mit- 
zel, 1963, Cronbach, 1972,Shrout, 1979): 

1. DISCRIMINANT - consistency over subjects, 

2. STABILITY - consistency over time, and 

3. INTERRATER - consistency among raters. 

Adequate outcomes within these three areas of reliability 
allow for generalization of the results to various teachers, 
observers, and teaching situations. 



Instrumant Development 

V 

The overall purpose of this research was to identify key el- 
ements that relate positively to student achievement. Bas- 
ing their work on Florida legislation (section 231.29), a 
team of four education specialists conducted an extensive 
search of the process-product research literature and iden- 
tified four observable domains of teacher behavior that con- 
sistently correlate with student achievement and also appear 
amenable to specification on an observation instrument. 

DOMAIN 2 Management of Student Conduct 

DOMAIN 3 Instructional Organization and Development 

DOMAIN 4 Presentation of Subject Matter 

DOMAIN 5 Communication: Verbal and Nonverbal 



Their review indicated that not only were positive teacher 
behaviors associated with student achievement^ but also, 
that ' specific and related negative behaviors appear to cor- 
relate negatively with achievement. Positive behaviors were 
termed Effective, and negative. Ineffective. Following con- 
cent validation and pilot testing, the final version of a 
sum tative instrument contained twenty indicators of effec- 
tive and twenty indicators of ineffective teacher behaviors. 
The indx waters were couched in behavioral terminology as 
much as possible in order to reduce coding ambiguity. Each 
of the four domains identified above was represented in the 
instrument, however, not proportionally. The final form of 
this instrument contained: 



DOMAIN 2 
DOMAIN 3 
DOMAIN 4 
DOMAIN 5 



2 items 
11 items 

4 xtems 

3 items 



An extensive study combining the expertise of university 
personnel, school district personnel and practicing teachers 
was conducted throughout the state of Florida during 1982-83 
to clarify, content validate and test the reliability of 
this summative instrument. 



Validation Procaduras 



Contant Validity . The content validity of this instrument 
was supported by: 

1. multiple independent sources for item development, 

2. use of only research based indicators of teacher ef- 
fectiveness for item development (indicators associ- 
ated in the research with student achievement meas- 
ures) , 

3. criticism and suggestions from knowledgeable persons 
external to the development, and 

4. input from nationally known experts in the fields of 
teacher effectiveness, educational research and ob- 
servation instruments. 

Concurrent Validity . In an attempt to compare ratings ob- 
tained uring the scaled scores from the summative instrument 
with rat. »igs of knowledgeable persons regarding teacher be- 
havior, two 9uch people rated each of the teachers in the 
reliability cudy on a scale from 1 (inadequate) to 4 (ex- 
cellent) for jach of the domains, and for each of their les- 
sons as a whole. Due to the low level of measurement embod- 
ied in this rating, a Spearman rank order correlation was 
run between these ratings and the standardized scores ob- 
tained during the reliablity study for each tape. There was 
a significant positive relationship between the ratings of 
these experts and the total scores for each tape on the sum- 
mative instrument (r = .55). However, there was no signifi- 
cant relationship between any of the expert's domain subs- 
cale ratings and those of the instniment. This indicates 
either that the scale is more precise in evaluating specific 
domains, or that individual domains lack sufficient- items to 
allow for specific evaluation using this instrument. Until 
this is resolved, only the total score should be used for 
decision-oriented evaluation. More study must be conducted 
before a reasonable understanding of the phenomenon rsay be 
proposed. 
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Dasign 

This study was designed to produce three intraclass correla- 
tion estimates of reliability derived from a three way Anal- 
ysis of Variance (Medley 1982). The following main effect 
sources of variance are identified using ANOVA: 

1- TEACHERS - variance among nine different teachers, 

2. LESSONS - variance between two separate lessons 
taught by each of the nine teachers, and 

3. RATERS - "ariance among nine teams of raters (ob- 
servers ) . , 

These three main effec s combined with various interaction 
effects are used to produce estimates of the following forms 
of reliab'.lity: 

1. DISCRIMINANT - The consistency with which a test dif- 
ferentiates between different subjects (teachers) on 
a specific scale. 

If the instrtiment does not reliably discriminate 
aunong teachers having different behaviors, it caumot 
be used to evaluate levels of behavior. 

2. INTERRATER - The consistency with which different ra- 
ters score the same behavior exhibited by the same 
sub j ect ( teacher ) . 

If different raters do not produce consistent 
scores for the same teacher on the same le&son, one 
can assume either, that the items comprising the in- 
strxament are ambiguous, or that the raters are not 
adequately trained. 

3. STABILITY - The consistency with which a specific 

iect (teacher) exhibits the same or similar behav- 
at two different times. 

Teachers were observed while teaching two l-essons 
different in content, but similar in format. The 
teaching behaiors should not be altered siU^stantially 
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by changing the cQntent, as long as the lesson format 
remains the same. Variance accounted for by the in- 
teraction of raters and lessons, as well as that re- 
3ulti.*g from the teacher and lesson interaction pro- 
vide an estimate of the stability of the teaching act 
over time. 

Nine teachers were each observed teaching two lessons (a to- 
tal of 18 lessons) by nine teams of raters. Eighteen video 
tapes of actual teacher's classroom behaviors were created, 
observed by the raters, and scaled into five separate 
scores: One total summative score and four subscales, one 
for each domain included in the summative instrtanent. 



Scaling 

In order to develop summative scores capable of rating 
teacher^ from HI to LOW on beh&viors, a total sca^e based 
upon p .rTt slized scores for each individual item was created 
(see DA-* IRANSFORMATION for transformation procedures ap- 
plied). Each item on the instrument was standardized to a 
mean of 5, so ':hat the mean total score was 100 (20 items 
times an avera«»e score of 5). One total scale for the 20 
positive items and one for the 20 negative items was creat- 
ed. In addition, subscales were developed for each domain 
contained in the instrument. As a result of factor analysis 
conducted upon a preliminary version of this instrument, two 
items (#1 - Begins Instruction promptly, and #11 - Circu- 
lates and assists students) were transferred from the Domain 
3 subscale to the Domain 2 subscale. ThUs, the final ver- 
sion of Domain 2 contains four items instead of two, and Do- 
main 3: includes nine items instead of eleven. The scores in 
this report consist of the summed normalized scores produced 
by teams of three observers for tJie following scales (see 
Appandix B for item content) : 
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TABLE 1 
Item Content of Scales 



SCALE 


NUMBER OF 


ITEMS 


NAME 


ITEMS 


INCLUDED 


TOTAL 


20 items 


1 thru 20 


Domain 2 


4 items 


1,11,19,20 


Domain 3 


9 items 


2 tliru 10 


Domain 4 


4 items 


12 thru 15 


Dcnain 5 


3 items 


16 thru 18 



Scaling Concerns . One major concern while developing 
scales for this instrument is the unusual nature of items 19 
and 20 in Domain 2. 

1. ITEM 19 — stops misconduct 

2. ITEM 20 — maintains instructional momentum 

Since' these items deal specifically with control of behav* 
ior, their scale points ^ave different meanings than other 
items. For the other 18 effective indicators, the more 
items marked, the higher a teacher's score. For items 19 
and 20, however, the best possible score would be zero, as 
this would indicate a class under perfect control; requiring 
no teacher intervention to maintain momentum. The second 
best score would show only effective behaviors, and of 
course, the worst, only ineffective. This could cause some 
problems when interpreting a summative total score on the 
instmment. In practice, however, over several pilot tests, 
no teachers in the lower levels (below high school), exhib- 
ited zero behaviors for these items. For all practical pur- 
poses, the scales may therefore be considered uniform across 
all items, at least at the present time. The applicability 
of 0o;aain 4 (Presentation of Subject Matter) to all levels 
of tesaching poses another possible scaling problem. Some 
hypochesize that the use of this domain will vary substan- 
tially across grade levels, appearing only rarely in the 
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early grades, and becoming successively more common as grade 
level rises. It will be necessary to obtain siibstantially 
more data than is currently available to make decisive 
statements regarding this question. 

Team Scores . Team scores were created by taking the mean 
transformed item scores for each team as the unit of analy- 
sis. These team item scores were then summed separately for 
the total instrument and each domain. Separate three way 
independent Analyses of Variance were conducted on each of 
the five resulting scales (total instrument. Domain 2, Do- 
main 3, Domain 4, and Domain 5). 



Reliability Estimates 

Within reliability theory, the generally accepted definition 
of an obtained score is as follows: 

O = T E 

where O = Obtained score 
T = True score 
E = Error score. 

Reliability estiaate* are generally formulated in the fol- 
lowing fashion using the elements of t^e preceding model: 

r = 1 - (E/0) 

This estimate separates the True scoi?'e variance from the Er- 
ror score variance present in the detained score (Nunnaly, 
1978). The resulting reliability estimates are limited by 
the number of sources of variance identified by the model in 
use. From the sums of squares generated by the independent 
ANOVA, it is possible, using intraclass correlations, to 
create estimates of reliability that do not rest upon the 
rigorous set of asstxnqptions underlying the F or T statis- 
tics. (Lindeman 1978, Medley fi^d Mitzel 1963, Cronbach 
1972, Guilford and Fructer, 1973)^ 



This model uses the variance estimates produced by a three 
way ANOVA to identify souces of error across teachers, be- 
tween lessons and among raters, as well as the interactions 
of these components. The variance estimates are then en- 
tered into an appropriate formula generated by classical re- 
liability theory to obtain three separate and quite differ- 
ent estimates of reliability based on a single error term. 
Each of these estimates provides information regarding the 
generalizability of the results from this study. 



TABLE 2 

Independent ANOVA Sources for Sums of Squares 
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MAIN EFFECTS 

(SST) Teachers 
(SSR) Raters 
(SSL) Lessons 

INTERACTION EFFECTS 



NUMBER 

N 
K 
D 



df 

N-1 
K-1 
D-1 



(SSTXL) Teachers with Lessons 
(SSTXR) Teachers with Raters 
(SSRXL) Raters with Lessons 

RESIDUAL EFFECTS — ERROR 



(K-1)(D-1) 
(N-l)(K.l) 
(N-1)(D-1) 



(SSRXLXT) Raters with Teachers 
with Let sons 



{N-1)(K-1)(D-1) 



Within the analysis cot luct*»d^ error is considered that 
variance which is eitht-r unexplained or unaccountable. 
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TABLE 3 

Resulting Reliability Estimates 



TRUE SCORE = SST - SSTXL - SSTXR + SSRXTXL 

ERROR SCORE = SSTXL SSTXR + SSRXTXL 

OBTAINED SCORE = TRUE SCORE + ERROR SCORE 

ERROR for Stability = ERROR ( ST )= SSTXL + SSRXTXL 

ERROR for Rater Consistency = ERROR(R)= SSRXT + SSRXTXL 



1. DISCRIMINANT 
(between teachers) 

2. STABILITY 
(ever time) 

3 . INTER- '-rEAM 
(among obs^^rvers) 



= 1 - ERROR / OBTAINED 



= 1 - ERROR (ST) / OBTAINED 



= 1 - ERROR(R) / OBTAINED 



Subjects 



Raters . Forty two raters (observers) were trained in four 
Florida counties (Lee, Hillsboro, Pasco and Pinellas). The 
observers were volunteers, most of whom occupy administra*- 
tive or supervisory positions in their county school dis- 
tricts. Each received five weeks of training (3-4 hours per 
week) prior to conducting observations for this study. Of 
the 42 total, 27 viewed at least 16 or more of the 18 video 
tapes. In an attempt to stabilize the observation scores, 
these 27 observers were randomly divided into teams of three 
observers each, and mean scores for teams were used as the 
unit of analysis. 

Teachers . Nine teachers from Hillsboro and Orange coun- 
ties were video taped while conducting actual lessons in a 
classroom setting. Teachers of various quality, styles, ex- 
perience, grade levels and subject specialties were included 
in the study in an atten^t to avoid possible bias in teach- 
ing style that might result from voluntary participation. 
Observers confirmed that a variety of teacher styles and 
quality were present in this study. 



ERIC 



17 



RESULTS 



Rt pontes 

Seven hundred sixty eight observations were obtained from 42 
observers of the 18 video tapes. Analysis was limited to 27 
observers for whom at least 16 observations were available. 
This allowed for nine teams of three randomly assigned ob- 
servers. 



Scaling 



Djla Transformations . The itiwn distr|jbutions in this study, 
as is ty ucal for observation instrum^ts, were character- 
ized by ( ftreme non- normality, with sKews ranging as high as 
5.2. For thi« reason, area transformations (Soar 1982) were 
conducted on the data to normalize the distributions. This 
was accomplished within the Statistical Analysis System 
(SAS) by first transforming each score to a percentile rank, 
to eliminate the extended nature of the tails; then stan- 
dardizing the rank transformed data to further normalize the 
distributions. Both percentile and normal transformations 
were conducted using the SAS Rank subprogram. This resulted 
in individual item distributions more closely corresponding 
to the theoretical Gaussian (normal), at least with regard 
to skew, kurtosis, and the relationship between the standard 
deviation and the semi -interquartile range. Team scores 
were created from the transformed item scores. The three 
individual scores for each team were averaged to create a 
mean team score on each item. These item scores were then 
summed into scale scores for the total instrument and each 
domain separa-cely. See Table 1 for specific scale composi- 
tion. 

Data Verification . Data obtained from this study were en- 
tered onto disk packs and accessed using the IBM 370 at the 
University of South Florida (Tampa, Fl.). All data were in- 
dependently duplicated onto a second data set. The two data 
sets were compared; all discrepencies were referred back to 
the original observation instzruments, and both data sets 
were corrected. 
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AnaivMS and Spactfic Rasuitt 



ReUability of Effective Scales . Separate three way independent 
Analyses of Variance were conducted on each of the five re- 
sulting scales (total instrument, domain 2, domain 3, domain 
4, ar«d domain 5). This produced three estimates of reli- 
ability for each scale and subscal^: 



1. DISCRIMINANT - across teachers, 

2. STABILITY - over time, and 

3 . INTERRATER - among raters 

Table 4 gives the results obtained for the five scales iden- 
tifying effective indicators of teacher behavior for the en- 
tire study (nine teams, nine teachers, two lessons) . As one 
can see, these reliability estimates for the entire study 
are exceptionally high, with only domains 2 and 5 having es- 
timates below 88, and no interrater estimate below .94. 
Perhaps a morfe realistic estimate of the probable reliabili- 
ty of this instru! ent in actual practice is given in Table 
5, based on two ok :erv^tions by a team of three observers. 



TABLE 4 

Reliability Estimates for Five Separate Scales 



Nine Teams of Three Raters 
Observing Two Lessons (27 raters) 



TYPE 
OF 

RELIABILTY 


TOTAL 
SCALE 
20 ITEMS 


DOMAIN 
2 

4 ITEMS 


DOMAIN 
3 

9 ITEMS 


DOMAIN 
4 

4 ITEMS 


DOMAIN 
5 

3 ITEMS 


DISCRIMINANT 


.91 


.60 


.89 


.91 


.63 


STABILITY 


.92 


.61 


.90 


.94 


.63 


INTERRATER 


.98 


.88 


.99 


.98 


.94 
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TABLE 5 

Reliability Estimates for Five Separate Scales 



One Team of Ti^ree Raters 
Observing Two Lessons 



TYPE TOTAL DOMAIN DOM/'N DOMAIN DOMAIN 

OF SCALE 2 3 4 5 

RELIABILTY 20 ITEMS 4 ITEMS 9 ITEMS 4 ITEMS 3 ITEMS 



DISCRIMINANT .79 .31 .80 .81 .42 

STABILITY .86 .37 .85 .88 .42 

INTERRATER .85 .45 .89 .85 .63 



Table 5 r presents the reliability estimates for positive 
indicators resulting from a single team of three observers 
observing nine teachers each teaching two lessons. The re- 
sults indicate a relatively high reliability for the total 
scale (20 items), as well as for Domains 3 (9 items) and 4 
(4 items). Domain 2 (4 items) exhibits moderate reliabili- 
ty, however Domain 5 (3 items) estimates are below generally 
accepted levels for reliabiity. The lower reliability esti- 
mates for Domains 2 and 5 probably result from the small 
number of items in these subscales. In addition. Domain 5, 
Communication, appears to overlap all other domains, thus 
reducing its independence and resulting in ircreased ambigu- 
ity. 

The high esti.^^.tes for Domains 3, 4 and the Total scale sug- 
gest that they are appropriate for classroom application. 
The moderate estimates for Domain 2 suggest caution in its 
use as a subscale for evaluation. Domain 5 should probably 
not be used as a specific subscale based on these results. 



ReliabMity of Ineffective Scales. The following table shows 
the output for the negative scores (ineffective indicators) 
from the instrument. Reli^ability estimates for the ineffec- 
tive indicators are consistently lower than for the effec- 
tive scores, and all ineffective scales exhibit questionable 
levels of reliability. 
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TABLE 6 

Reliability Estimates for Ineffective Scales 



One Team of Three Observers 
Observing Two Lessons 



TYPE 
OF 

RELIABILITY 



TOTAL 
SCALE 
20 ITEM 



DOMAIN 
2 

2 ITEMS 



DOMAIN 
3 

11 ITEMS 



DOMAIN 
4 

4 ITEMS 



DOMAIN 
5 

3 ITEMS 



DISCRIMINANT .64 



STABILITY .82 



INTER-TEAM . 73 



.37 
.67 
.55 



40 
72 
49 



66 

77 
69 



.58 
.71 
.64 



Possible reasons for the lower estimates from ineffective 
indicators include: 



1. Far fewer instances of ineffective behaviors occurred 
in the study than of effective behaviors, 

2. Ineffective indicators appear more diffuse (less 
clearly definable) than effective indicators, result- 
ing in greater coding ambiguity, and 

3. Several ineffective items require the observers to 
code "missing" behaviors (e.g. circulates inadequate- 
ly, delays, etc.) Observers appear to code what the 
teacher does more accurately than what the teacher 
does not do. 



Effects of Multiple Observers . The level of reliability re- 
sulting from using larger or smaller teams of observers was 
investigated by computing separate reliability estimates for 
randomly selected teams of observers. One team contained 
three observers, one contained two observers, ana one team 
consisted of a single observer. 
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TABLE 7 

Number of Observer's Effects on Reliability Estimates 
One Team of Three Observers/ Two Lessons 



NUMBER OF 
OBSERVERS 



TOTAL DO^fAIN DOMAIN DOMAIN DOMAIN 

SCALE 2 3 4 5 

20 ITEMS 4 ITEMS 9 ITEMS 4 ITEMS 3 ITEMS 



THREE 

TWO 

ONE 

THREE 

TWO 

ONE 

THREE 

TWO 

ONE 



79 
75 
55 



86 
81 
70 



85 
82 
64 



=DISCRIMINANT= 



61 



49 



25 



80 



.76 



58 



:STABir ITYs 



64 



55 



26 



• 8^ 



.80 



77 



WINTER TEAM= 



66 
65 
50 



.87 
.85 
.67 



.81 



77 



.54 



.88 



.85 



.72 



.85 
.81 
.71 



.42 



38 



.15 



.42 



.54 



.37 



.63 
.54 
.30 



Table 7 indicates there is little difference between the 
reliability estimates if at least two observations are aver- 
aged to create scores. However, there is a considerable 
Joss of reliability when only a single observer is used. 
These results suggest the use of at least two observers for 
evaluations . 
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Effacts of NuwbT of Visits per Ratar Team . 

TABLE 8 

Reliability Estimates for Three Visits by One Team 



INDICATORS 



TOTAL DOMAIN DOMAIN DOMAIN DOMAIN 

SCALE 2 3 4 5 

20 ITEMS 4 ITEMS 9 ITEMS 4 ITEMS 3 ITEMS 



POSITIVE 
NEGATIVE 

POSITIVE 
NEGATIVE 

POSITIVE 
NEGATIVE 



83 
68 



90 
87 



87 
75 



==D I SCRIMINANT=« 
.39 .84 .85 

.41 44 .71 

==STAB I L I TY=«==== 
.4' .90 .92 

.76 .79 .83 



===== INTER RATER===~ 
.51 .90 .88 

.55 .51 .73 



.52 
.51 



.52 
.48 



.69 
.54 



Table 8 shows the effects of increasing the number of visits 
by each team. The reliability coefficients do increase over 
those obtained for two visits, but not substantially. This 
indicates that two visits by one team of two observers is 
probably optimum, with little gain from an increased number 
of visits. 

Domain and Subscale Independence . Theoretically, effective 
and ineffective indicators in a specific domain should not 
correlate highly with each other. In addition, for the en- 
tire instniment and all subscales, a low correlation is ex- 
pected between negative and positive subscale scores. To 
test for the independence cf effective and ineffective sub- 
test scores on the summative instniment, all intercorrela- 
tions between these domain scores were computed. The re- 
sults are shown in Table 9. 
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TABLE 9 

Relationship of Effective to Ineffective Domain Scores 



POSITIVE 
DOMAINS 


DOMAIN 
2 


NEGATIVE 
DOMAIN 
3 


DOMAINS 

DOMAIN 
4 


DOMAIN 
5 


DOMAIN 2 


-.111 


.079 


.189 


.232 


DOMAIN 3 


.339 


.107 


.127 


.398 


DOMAIN 4 


.359 


.278 


-.144 


.140 


DOMAIN 5 


.218 


.187 


.003 


.287 



Most of the correlations are very low, only negative Domain 
2 with positive Domains 3 and 4, (Rs=.33) and negative Domain 
5 with positive Domains 3 and 5, (R=.39,.29) show any sig- 
nificant relationship, and then of a moderate degree. Only 
Domain 5 shows a significant positive correlation between 
effective and ineffective domain scores, and this is not 
surprising since Domain 5 (Communication), is involved in 
all teacher behaviors. Thus one may conclude that the posi- 
tive and negative indicators are relatively independent. 

Intercorrelationi Among Poaitive Domain Scores . 

As Table 10 shows, all domain scores correlate positively 
with the Total score, and indicate at least a moderate posi- 
tive relationship to each other. Three relationships appear 
of particular interest: the Domain 3 score correlates highly 
with the Total score (r«.93), there is a negative correla- 
tion between Domain 2 and 4, and Domain 5 (Communication) 
correlates highly with all other scores except Domain 2. 

Thus it appears from the moderate positive intercorrelations 
among these scales, that the domain scores are moderately 
related, but not identical. This strengthens the case for 
Domain 3, consisting as it does of 45% (9) of the total 
items on the instrument, dominates the relationships sunong 
domains, and total scores. This indicates that items in Do- 
main 3 may be the best discriminators across teachers. 

I 
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TABLE 10 

Intarcorrelations Among Positive Domain Scores 





TOTAL 


D0M2 


D0M3 


D0M4 


D0M2 


.359 


1.000 






D0M3 


.932 


.268 


1.000 




D0M4 


.543 


-.285 


.339 


1.000 


D0M5 


.789 


.167 


.635 


.500 



Item Analysis . The abili ty of individual items to discrim- 
inate betveen "high scoriug" teachers and "low scoring" 
teachers was tested in the f 'llowing fashion: 

1. Two groups (HI and LOW^ weid created to determine 

each itenr's ability to discriminate between teachers 
receiving high scores on the scale and those receiv- 
ing low scores: 

a) Croup 1 - consisted of the top six tapes on the 
summed total score. 



b) Group 2 - consisted of the bottom six tapes on the 
summed total score. 



i) By conducting t-tests between groups 1 and 

2, we are able to estimate an item's ability 
to discriminate between effective and inef- 
fective teachers as defined by this scale. 



Since an item should discriminate between high and 
low scoring teachers, yet not discriminate between 
the same teacher on two separate occasions, the fol* 
lowing pair of groups were created: 



a) Group A - consisted of each teacher's first les- 
son, and 
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b) Group B - consisted of each -teacher's second les* 
spn. 




By conducting t- tests between groups A and 
B, we are able to determine whether an item 
fallaciously discriminates the same teacher 
from herself on two separate occasions. 




Ideally, an itpm wi'l indicate significant differences be- 
tween groups 1 and 2, and no significant differences between 
groups A and Bi^ 

As Table 11 shows, most of the items discriminate between HZ 
and LOW scorinir teachers, yet do not discriminate between 
the same teacher on two different lessons. Only item's 19 
and 20 (Domain,! 2) appear to erroneously discriminate between 
two lessons by the same teacher. Reasons for this effect 
are as yet unkhown, however, these items are somewhat dif- 
ferent from the others as noted on page 6. 

These results again support the superior diacriminatio;* 
of items in Domain 3, as all nine terns attained a t-value 
of 6.41 or greater. Only five other Stems, two in Domain 2, 
achieved this level. 
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TABLE 11 

Individual Item's Ability to Discriminate Among Teachers 



T - VALUE BETWEEN T - VALUE BETWEEN 

ITEMS GRPS 1 AND 2 GRPS A AND B 





HI vs LOW GROUPS 


SAME TEACHER 


TWO LESSONS 


1 


1.00 * 


23 




2 


10 85 


16 




3 


1 . 19 


1 39 




4 


10.05 






5 


13.21 


1.02 




6 


10.48 


'.75 




7 


6.79 


.41 




8 


7.15 


3.60 




9 


8.19 


.28 




10 


6.41 


, 12 




11 


6.71 


1.-<I6 




12 


1.00 * 


.14 




13 


4.25 


1.03 




14 


1.98 * 


1.14 




15 


3.18 


.62 




16 


1.00 * 


1.64 




17 


10.18 


.75 




18 


6.59 


.54 




19 


6.77 


7.93 




20 


7.40 


8.09 





* t value non significant (p<.01) 
** t value significant (p<.01) 
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internai Consistency RcUability Estimates . Although the scales 
developed from this instrxinent are not designed to be homo- 
geneous^ internal consistency often influences later re- 
search conducted using a specific instrument. For this rea- 
son. Coefficient Alpha estimates of internal consistency 
(Cronbach, 1951) were computed for all five positive (effec- 
tive) scales for both raw and normalized data using the SPSS 
Reliability subprogram (release #9). Table 12 shows that 
the normalized items produce a higher estimate of internal 
consistency than do raw scores. The total score estimate 
(.69) is encouraging for a scale of this type which contains 
several apparantly independent subscales. This, in concert 
with the results of Item Discrimination indicates that the 
scale will probably correlate with other reliable measures 
dealing with teacher behavior, and be capable^ of differenti- 
ating between different groups. 



TABLE 12 

Internal Consistency Estimat'-'s for Five Scales 



SCORE 
TYPE 


TOTAL 
20 


-50M2 
4 


DOM3 
9 


DOM4 
4 


DOM5 
3 


RAW SCORES 
« 

NORMALIZED 


.53 


.44 


.49 


.45 


.20 


.69 


.40 


.63 


.58 


.37 



Cotm>ariSon of Repeated and i ndeiienctont Estimates. In addition 
to the independent three way ANOVA conducted on the various 
scale and subscale scores for the sum<native instrument, a 
slightly different and generally more conservative estimate 
of reliability was conducted using a three way repeated 
measures ANOVA (Medley 1982). Table 13 compares the ob- 
tained estimates based on both independent and repeated 
measures ANOVAs. These estimates are for nine teauns of three 
observers each observing two lessons. 

As expected, the independent estimates are higher than the 
repeated estimates, for example Discriminant r = .91 vs. r = 
.76; however, the repeated estimates are quite high for this 
type of scale, providing further support for the reliability 
of the scales contained in this summative instrument. 
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TABLE 13 

Repeated Measures vs Independent ANOVA Reliability Estimates 







Nine 


Teams of 
Two 


Three 
Lessons 


Observers 






SUBSCALE 


Discriminant 
Indep. Repeat 


Stability 
Indep. Repeat 


In\;er' 
Indep. 


-team 
Repeat 


Total 


(20) 


.91 


.76 


.92 


.85 


.98 


.82 


Dom 2 


(4) 


.44 


.30 


.45 


.34 


.95 


.66 


Dom 3 


(9) 


.68 


.80 


.88 


.81 


98 


.90 


Dom 4 


(4) 


.91 


.87 


.94 


.87 


.98 


.91 


Dom 5 


(3) 


.63 


.36 


.63 


.48 


.94 


.54 



Factor Analysis. In an attempt to verify the structure of 
domain indicators, two separate factor analyses were con- 
ducted on the reliability study data, and compared to prior 
analysis conducted on training films using a preliminary 
form of the summative instrument. 

Since the total sample of observations for the beginning 
teacher evaluation reliability study consisted of 768 obser- 
vations (9 teachers, 2 lessons, 42 observers), and there 
were only 20 items, (only the positive items are included 
due to the inconsistency of negative Indicators), it was 
possible to divide the total sample into two separate sub- 
samples thereby allowing for an internal cross validation. 
One group was created from the first lesson for each teach- 
er, and a second from the second lesson for each teacher. 

Six factors were rotated during the analysis. Although the 
factors exhibited intransigence (were consistent from one 
sample to the other), no single factor appeared to account 
for a large amount of the variance (factor 1 - 15% was the 
greatest). This suggesro that the factors are consistent, 
and located within the general domain structure of the sum- 
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mative instrument. Tables 14 and 15 provide a simplified 
depiction of the obtained factor loadings: 



TABLE 14 

Loadings for Factors One, Two and Three 



ITEM 


DOM 


FACTOR #1 


FACTOR #2 


FACTOR #3 


# 


# 


GRFl 


GRP2 




GRP2 

mm 






3 


3 


.44 


.22 


.21 


.41 






4 


3 


.48 


.32 










5 


3 


.82 


.78 










6 


3 


.61 


.75 










7 


3 


.80 


.74 










2 


3 






.62 


.62 






8 


3 






.66 








9 


3 






.64 








10 


3 






.41 


.70 






11 


2-3 






.33 


.42 






17 


5 






.42 


.31 






12 


4 










.84 


.75 


13 


4 










.80 


.70 


14 


4 










.46 


.25 


15 


4 












.51 


16 


5 










.42 


.60 


20 


2 










.45 





Factors 1 through 3 are more consistent across groups, more 
consistent with the domain structures, and more heavily 
weighted on important items than are factors 4 through 6. 
All of th'» factors examined tend to locate within a specific 
domain across both samples, and follow very closely the re- 
sults obtained from a preliminary version of the instrument. 

1. Factor 1 appears to load primarily on items: #5 - 
asks single, factual questions; #6 - asks questions 
requiring analysis or reasoning; and #7 - recognizes 
response, amplifies, gives corrective feedback. Fac- 
tor 1 appears to be a questioning and response fac- 
tor. 
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TABLE 15 

Loadings for Factors Four, Five and Six 



ITEM 


DOM 


FACTOR #4 


FACTOR #5 


FACTOR #6 


W 




GRFl 


GRF2 


GRPl 


GRP2 


GRPl 


GRP2 


1 


2 


.71 


.75 










4 


3 


.51 


.49 










16 


5 


.30 


.30 










20 


2 


.35 


.49 










11 


2 






.65 








14 


4 








.48 






15 


5 






.50 








19 


2 






.64 


.81 






8 


3 










M M MP 


.61 


14 


4 










.4. 


.28 


16 


5 


.30 


.30 






.40 




17 


5 










.44 


. iO 


18 


5 










.70 


. .'5 



2. Factor 2 appears to load primarily on items: #2 - 
Handles materials in an orderly fashion, #9 - pro- 
vides for practice, and #10 - gives directions, as- 
signs, checks comprehension, etc. Thus factor 2 ap- 
pears to be an active interaction factor. 

3. Factor 3 consists almost exclusively of elements from 
Domain 4; the various forms for presentation of sub- 
ject matter. 

4. Factors 4, 5 and 6 cross domains and appear more am- 
biguous than factors 1 thru 3. They tend to locate 
within Domains 2 and 5, which themselves are more 
diffuse than Domains 3 and 4, at least as measured by 
this instrument. 

a) Factor 4 appears to deal with timing and momentum, 
loading primarily on items #1 - Begins Instruction 
Promptly, #4 - Conducts Beginning, Ending Review, 
and #20 - Maintains Instructional Momentum. 

t 

b) Factor 5 appears to relate to physical presence 
and active involvement, loading most heavily on 
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items #11 - Circulates and Assists Students, and 
#19 - Stops Misconduct (analysis conducted on the 
preliminary version of the instrument included 
only these two items in a factor, and in these 
analyses, they obtained the heaviest weights). 

c) Factor 6 appears to be an enthusiam factor, load- 
ing as it does on items #17 - Expresses Enthusi- 
asm, and #18 - Uses Body Behaviors That Show In- 

These analyses indicate that the factors generated by tiiis 
study tend to support the overall domain structure of the 
summative instrument. Since communication (Domain 5) is in- 
volved in all teacher behavior, it is not suprising to find 
it overlapping with factors 3, 4 and 6 in this analysis. 
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Limitations 



Because the study was designed to assure tha*- each rater ob- 
served the same teacher on the same two occa lions, it is not 
possible to estimate precisely the amount of vatiation that 
may result from two or more raters observing a s bject on 
different occasions. It is easier to generalize -o ti^e 
situation in which two raters obser*^e a teacher simultane- 
ously on two different occasions (four total observations). 

In addition, by observing video tapes in a controlled situ- 
ation, distractions that are almost certain to interfere 
with actual classroom observations were not present. 

At this point in time, there is no effective way to use the 
instrument for evaluation purposes, as no information exists 
for comparison to general teacher patterns of behavior. It 
may be used in its current level for identification of prob- 
lem areas and possible suggestions for remediation. 
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Conclusions 



The results of this study tend to support the following con- 
clusions: 

1. The sunonative observation instrument appears suffi- 
ciently reliable (.80 or above) for total scores and 
major subscales. 

a) Subscales for Domains 2 and 5, however ^ should be 
used with caution. 

2. A minimum of two observers should be involved in any 
evaluation iising this instrument. 

3. Positive indicators appear to be more reliable thar. 
negative indicators. 

4. The structure identified within the Domains of teaci.- 
er behavior appears to be supported by Factor Analyt- 
ic results. 

5.. Domain 5 (Communication: Verbal and Nonverbal) ap- 
pears to overlap with the other three domains in 
every test applied to the data. 

6. Most items appear to discriminate between high scor- 
ing and low scoring teachers, and not to discriminate 
between the same teacher. This result indicates a 
degree of relative homogeneity for items in the total 
instrument. Internal consistency results also sup- 
port this. 

7. The effective and ineffective domains appear to be 
relatively independent, with the exception of Domain 
5, which .overlaps all other domains. 
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G»nT«i Recqmmandationt 

This reliability study has produced positive results and the 
suramative instrument appears sufficiently reliable to 
produce consistent scores for teacher behavior when at least 
two observers are used to gather data. At this point in the 
investigation of the instnunent, we may expect these results 
to gexi^ralize to various teachers, various raters, and the 
same teacher in more than one situation. Although reliabil^ 
ity estimates will probably be lower for actual classroom 
application, they appear to be high enough that a small to 
moderate loss in the "Real World" environment will still 
yield useful results. 



Specific Reconwtendations 

Due to the lower reliability estimates resulting from the 
ineffective domains, we recommend that the highly reliable 
effective domain scores be used to identify general areas 
for remediation; followed by investigation of specific inef- 
fective observations as indicators of particular teacher 
practices that may require remediation. 

We recommend that standard normalized scores based on a 
I'^rge sample of teachers be developed to provide a basis for 
the evaluation of a specific teacher. 



We recommend that at least tivo separate observations by at least 
two observers be used as a basis for an individuals scores 
in evaluation. The mean scores from two separate observers 
will be compared to standard normalized scores. 

We recommend that this instrument ultimately be used as a 
part of the evaluation and remediation process for individu- 
al teachers. 
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Future Rese arch 



At the present time, the following topics are being investi- 
gated: 

1. INSTRUMENT VALIDATION, 

2. RATER (OBSERVER) CERTIFICATION, 

3. NORMING, and 

4. FRAME FACTORS. 



Validation 

In the future, research will be conducted to support v.'.lidi • 
ty by relating scores on this instrument with both ratings 
of teacher effectiveness, and achievement scores of stu- 
dents . 



Rater Certificatpn 

Prior to an individuals use of this instrument they must re- 
ceive sufficient training to accurately and reliably code 
indicators. This will probably be measured in the following 
fashion; students will observe two tapes of the same teach- 
er teaching the same lesson (or a very similar lesson). 
Their observations will be compared with an average result- 
ing from several trained raters (all tapes used in this 
study have at least 27 observer records), and a correlation 
will be computed. In addition, a correlation will be com- 
puted between the individual's first observation and his/her 
second observation. In this way, two measures of a prospec- 
tive rater's skills will be obtained (Frick and Semmel 
1978): 

1. ACCURACY - the relationship to a pre-established mas- 
ter score, and 

2. TEST-RETEST - the relationship between observations 
of two almost Identical situations at two different 
times by the same rater. 
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Norming 

Prior to the use of this instriunent for evaluation it will 
be necessary to determine behavior patterns of currently 
practicing teachers. This data must be viewed in light of 
at least three factors: 

1. Frpiae Factors « socio-demographic factors that appear 
to affect student outcomes, 

2. Behavior Patterns - specific behaviors will tend to 
associate with certain other behaviors. Absolute 
numbers of behaviors may not be as important as the 
relationship among the specific behaviors. 

3. Normalized (standardized) Scores - based perhaps upon 
averages for specific certified observers across all 
of their observations. It is assumed that some ob- 
servers will be lenient, while others will be strict 
in their interpretation of specific indicators. 
While the use of at least two observers should con- 
trol for this effect somewhat, even team scores '^ill 
tend to be either strict or lenient, and this muit 
also be considered in data interpretation. Scores 
normalized across rater teams should provide the bes 
source of information concerning an individual teach- 
er' s use of effective behaviors. 



Frame Factors 

Presently, several variables that: have shown historical re* 
lationsh .ps with student outcome v«.riables are being ob- 
tained along with teacher scores on the instrument, xiiese 
include: 

1. Academic status of students 

2. Socio-economic status of students 

3. Non-native language speakers 

4. Exceptionalities of students 

5. Class size 

6. Sex of students 

7. Classroom conditions 

8. Instructional material conditioj^s 

9. Teacher variables 

9.1 Experience 

9 . 2 Education 

9 . 3 Tenure 

9.4 Area of certification 

10. Subject (math, English, etc.) 

11. Grade level (elementary, middle, high school) 
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Tests will be conducted to determine which if any of these 
factors significantly influence teacher behavior as measured 
by the summative instrument. Regression analysis will be 
used to identify the most "important" factors, and may re- 
sult in differential norms for different factor combina- 
tions. 
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Appendix A 

Kn 'Jadge Baf and tha Florida Taachar Compatanciat 



The development of instruments for measuring classroom per- 
formance of teachers requires that a body of information 
about teaching be aasembled. Such information can be de- 
rived from two sources; ^irst, the consensus of opinion of 
informed persons such as teachers and pedagogical instruc- 
tors about the knowledge and skills believed to be necessary 
for effective classroom performance: and second, process- 
product and experimental research on teacher effectiveness. 

The Florida Conqpetencies were derived from consensus of 
opinion among informed school people. While these competen- 
cies are useful for gene*-al purposes, the original research 
team chose to turn to research literature as the source of a 
knowledge base for instrtament development. This approach 
was selected for the following reasons: 



1. The knowledge and skills derived from research liter- 
ature, as the knowledge base for evaluating teachers,- 
is easier to defend if contested. 

2. The research studies indicate precisely what teacher 
behaviors are p^^sitively associated with either stu- 
dent achievement or student conduct or both. 

3. The research literature enables one to cite specific 
evidence in support of particular teacher perform- 
ance. The language of research studies is precise 
and thus allow little chance foz misinterpretation. 

m 

4. The research findings relevant to teacher effective- 
ness provide grounds for an examination of the Flori- 
da competencies. 
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Appendix A 

! 

Indicators of Sumwativ Inatrumant 

Indicators of teacher ^ehavior are divided into two types, 
EFFECTIVE (positive) and INEFFECTIVE (negative). The fol- 
lowing descriptors are used on the summative instrument to 
specify essential characteristics of behaviors belonging to 
a particular domain: 



TABLE 16 

Summative Instrument Descriptors - Domain 2 





POSITIVE INDICATORS | 




NEGATIVE INDICATORS 


ITEM 
# 


DESCRIPTORS | 


ITEM 
# 


DESCRIPTORS 


1 


BEGINS INSTRUCT- j 
ION PRC»5PTLY j 


1 


DELAYS 


11 


CIRCULATES AND ASSIST | 
STUDENTS I 


10 


REMAINS AT DISK/CIRCULATES 
INADEQUATELY 


19 


STOPS MISCONDUCT | 


19 


DELAYS DESIST,'T30ESN'T 
STOP MISCONDUCT/DESISTS 
PUNITIVELY 


20 


MAINTAINS INSTRUCTIONAL | 
MOMENTUM | 


20 


LOSES MOMENTUM /FRAGMENTS 
NON ACADEMIC DIRECTIONS, 
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TABLE 17 

Sunimative Instniment Descriptors - Domain 3 





POSITIVE INDICATORS 




NEGATIVE TNntfATOR^ 


ITEM 


DESCRIPTORS 


1 ITEM 




2 


HANDLES MATERIAL IN 
AN ORDERLY MANNER 


1 2 


DOES NOT ORGANIZE OR HANDLE 
MATERIALS SYSTEBIATICALLY 


3 


ORIENTS STUDENTS TO 
CLASSWORK/MAINTAINS 
ACADEMIC FOCUS 


i 3 


ALLOWS TALK/ACTIVITY 
UNRELATED TO SUBJIECT 


4 


CONDbCTS BEGINNING/ 
ENDING REVIEW 






5 


ASKS SINGLE/FACTUAL 


1 4 


POSES MULTIPLE QUEfTIONS 


6 


ASKS QUESTIONS RE- 
QUIRING ANALYSIS OR 
REASONING 


1 5 


POSES NON-ACADEMIC QUESTION/ 
QUESTIONS 


7 


RECOGNIZES RESPONSE/ 
AMPLIFIES/GIVES 
CORRECTIVE FEEDBACK 


i 6 


IGNORES STUDENT OR RESPONSE/ 
EXPRESES SARCASM, DISGUST OR 
HARSHNESS 


8 


GIVES SPECIFIC 
ACADEMIC PRAISE 


1 7 


USES GENERAL, NON SPECIFIC 
PRAISE 


9 


PROVIDES FOR PRACTICE 


i B 


EXTENDS DISCOURSE, CHANGES 
TOPIC WITH NO PRACTICE 


10 


GIVES DIRECTIONS/ 
ASSIGNS /CHECKS COM- 
PREHENSION OF HOMEWORK 
SEATWORK ASSIGNMENT/ 
GIVES FEEDBACK 


1 9 


GIVES INADEQUATE DIRECTIONS/ 
NO HOMEWORK/NO FEEDBACK 
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TABLE 18 

Suxnmative Instrument Descriptors - Domain 4 





POSITIVE INDICATORS | 




NEGATIVE INDICATORS 


# 


DESCRIPTORS | 


ITEM 
# 


DESCRIPTORS 


12 


TREATS CONCEPT/DEF- | 
INITION/ATTRITUBES/ | 
EXAMFLES/NON-EXAMPLES | 


11 


GIVES DEFINITION OR EXAMPLES 
ONLY 


13 


DISCUSSES CAUSE-EFFECT 1 
EFFECT/USES LINKING | 
WORDS/APPLIES LAW 1 
OR PRINCIPLE 1 


12 


DISCUSSES EITHER CAUSE OR 
EffECT ONLY/USES NO LINKING 
WORDS 


14 


STATES AND APPLIES | 
ACADEMIC RULE { 


13 


DOES NOT ST^TE OR DOES NOT 
APPLY ACADE.IC PTJLE 


15 


DEVELOPS CRITERIA AND | 
EVIDENCE FOR VALUE j 
JUDGEMENT | 


14 


STATES VALUE JUDGEMENT WITH 
NO CRITERIA OR EVIDENCE 



TABLE 20 

Suiomative Instrument Descriptors - Domain 5 



POSITIVE INDICATORS 



ITEM 
# 



16 



17 



18 



DESCRIPTORS 



EMPHASIZES IMPORTANT 
POINTS 



EXPRESES ENTHUSIAM 

VERBALLY/CHALLENGES 

STUDENT 



USES BODY BEHAVIOR 
THAT SHOWS INTEREST/ 
SMILES, CE^iTURES 



NEGATIVE INDICATORS 



ITEM 
# 



DESCRIPTORS 



1*^ USES VA'iUE/ 

SCRAMBLED DISCOURSE 

16 SES LOUD-GRATING, HIGH 

IrlTCh^, MONOTONE, INAUDIBLE 
TALK 

18 FROWNS, DEADPAN OR 
LETHARGIC/ 
OVERDWELLS 
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Appendix C 

Computation of Raliabilitv Estiniatas 



The following tables indicate the use of Sums of Squares 
generated by a Three Way ANQVA on the Total Score (20 items) 
to produce three reliability outputs; (1) discriminant, (2) 
stability, and (3) interrater. For the development of these, 
formulations see (Medley and Mitzel, 1963, Shrout, 1979). 



TABL^ 21 

Sourwss of Variance for Reliability Estimates 



SOURCE 


N 


DF 


SS'S 


*1EAN SQ 


LETTER 


TEACHER 


9 


8 


3435.69118306 


429.4610 


a 


LESSON 


2 


1 


18.28686575 


18.2869 


b 


RATERS 


9 


8 


213.04918227 


26.6310 


c 


TEACH*LESS 




8 


273.17772720 


34.1472 


d 


TEACH* RATE 




64 


540.44689727 


8.4444 


e 


LESS*RATE 




8 


83.17933286 


10.3974 


f 


RATE* LESS*TEACH 




64 


277.97843948 


4.3434 


g 



Using the mean squares shown above, the independent ANOVA 
estimates are produced in the following fashion: 
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TABLE 22 
Computations for Total Scale 



ONE TEAM OF RATERS 
TWO LESSONS 
NINE TEACHERS 



K = NUMBER OF LESSONS (SITUAVIONS) IN ESTIMATE 2 
N = NUMBER OF RATER TEAMS IN SOURCE STUDY = 9 

Nl s= NUMBER OF RATER TEABIS IN THIS ESTIMATE = 1 



TRUE SCORE - (K*N1)<A - D— E ♦ G)/2N 

= 2(429.460-34. 147-8. 444+4. 343) /9 
» 43.468 

ERROR SCORE - (N1*(D - G))/N i- K(E - G)/2 + G 

= (34. 147-4. 343 )/9 + 2 (8. 444-4.343 )/2 + 4.343 
= 11.756 

ERROR(S) STABILITY = (N1*(D-C))/N * G 

= (34. 147-4. 343 )/9 
= 7.655 

ERROR(R) RATER = K(e - g)/2 + g 

= 2(8.444-4.343 )/2 + 4.343 
^ 8.444 

OBTAINED SCORE = TRUE + ERROR 

= 43.468 + 11.756 
= 55.224 

r (Discrimant) = 1 - ERROR/OBTAINED 

= 1 - 11.756/55.224 
= .787 

p (Stability) = 1 - ERROR( S)/OBTAINED 

= 1 - 7.655/55.224 
= .861 

> g (Interrater) = 1 - ERROR(R)/0BTAINED 

= 1 - 8.444/55.224 
= .847 
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